In this document, I am hoping to mostly copy/paste material from the tests/ tree and explain the various functionalities therein. It is my hope therefore to step from data loading all the way through ontology searching with appropriate visualizations at each stage.
In test_01load_data.R I perform load some data into an expressionset and get ready to play with it.
## I use sm to keep functions from printing too much (well, anything really)
tt <- sm(library(hpgltools))
tt <- sm(library(pasilla))
tt <- sm(data(pasillaGenes))
biomart is an excellent resource for annotation data, but it is entirely too complex. The following function ‘get_biomart_annotations()’ attempts to make that relatively simple.
## Try loading some annotation information for this species.
gene_info_lst <- sm(load_biomart_annotations(species = "dmelanogaster",
host = "useast.ensembl.org"))
gene_info <- gene_info_lst[["annotation"]]
info_idx <- gene_info[["gene_biotype"]] == "protein_coding"
gene_info <- gene_info[info_idx, ]
rownames(gene_info) <- make.names(gene_info[["ensembl_gene_id"]], unique = TRUE)
head(gene_info)
## ensembl_transcript_id ensembl_gene_id description
## FBgn0260439 FBtr0005088 FBgn0260439 Protein phosphatase 2A at 29B
## FBgn0000056 FBtr0006151 FBgn0000056 Adh-related
## FBgn0031081 FBtr0070000 FBgn0031081 Neprilysin 3
## FBgn0031085 FBtr0070002 FBgn0031085
## FBgn0062565 FBtr0070003 FBgn0062565 Odorant receptor 19b
## FBgn0031089 FBtr0070006 FBgn0031089
## gene_biotype cds_length chromosome_name strand start_position
## FBgn0260439 protein_coding 1776 2L + 8366038
## FBgn0000056 protein_coding 819 2L + 14615552
## FBgn0031081 protein_coding 2361 X + 19961297
## FBgn0031085 protein_coding 633 X + 20051294
## FBgn0062565 protein_coding 1164 X + 20094398
## FBgn0031089 protein_coding 1326 X + 20148124
## end_position
## FBgn0260439 8370090
## FBgn0000056 14618902
## FBgn0031081 19969323
## FBgn0031085 20052519
## FBgn0062565 20095767
## FBgn0031089 20155514
The pasilla data set provides count tables in a tab separated file, let us read them into an expressionset in the following block along with creating an experimental design. create_expt() will then merge the annotations, experimental design, and count tables into an expressionset.
## This section is copy/pasted to all of these tests, that is dumb.
datafile <- system.file("extdata/pasilla_gene_counts.tsv", package = "pasilla")
## Load the counts and drop super-low counts genes
counts <- read.table(datafile, header = TRUE, row.names = 1)
counts <- counts[rowSums(counts) > ncol(counts),]
## Set up a quick design to be used by cbcbSEQ and hpgltools
design <- data.frame(row.names = colnames(counts),
condition = c("untreated","untreated","untreated",
"untreated","treated","treated","treated"),
libType = c("single_end","single_end","paired_end",
"paired_end","single_end","paired_end","paired_end"))
metadata <- design
colnames(metadata) <- c("condition", "batch")
metadata[["sampleid"]] <- rownames(metadata)
## Make sure it is still possible to create an expt
pasilla_expt <- sm(create_expt(count_dataframe = counts, metadata = metadata,
savefile = "pasilla", gene_info = gene_info))
In this block I will use a single function graph_metrics() to plot them all. And then follow up with the one at a time. Many functions in hpgltools are quite chatty with liberal usage of message(), as a result I will sm() this call to silence it.
pasilla_metrics <- sm(graph_metrics(pasilla_expt, ma = TRUE, qq = TRUE))
summary(pasilla_metrics)
## Length Class Mode
## boxplot 9 gg list
## corheat 3 recordedplot list
## cvplot 9 gg list
## density 10 gg list
## density_table 5 data.table list
## disheat 3 recordedplot list
## gene_heatmap 0 -none- NULL
## legend 3 recordedplot list
## legend_colors 3 data.frame list
## libsize 9 gg list
## libsizes 4 data.table list
## libsize_summary 7 data.table list
## ma 21 -none- list
## nonzero 9 gg list
## nonzero_table 7 data.frame list
## pc_loadplot 3 recordedplot list
## pc_summary 4 data.frame list
## pc_propvar 6 -none- numeric
## pc_plot 9 gg list
## pc_table 14 data.frame list
## qqlog 3 recordedplot list
## qqrat 3 recordedplot list
## smc 9 gg list
## smd 9 gg list
## topnplot 9 gg list
## tsne_summary 4 data.frame list
## tsne_propvar 20 -none- numeric
## tsne_plot 9 gg list
## tsne_table 10 data.frame list
Print some plots!
pasilla_metrics$libsize
## The library sizes range from 8-21 million reads, this might be a problem for
## some analyses, but it should be ok
pasilla_metrics$nonzero
## Ergo, the lower abundance libraries have more genes of counts == 0 (bottom
## left).
pasilla_metrics$boxplot
## And a boxplot downshifts them (but not that much because it decided to put
## the data on the log scale).
pasilla_metrics$density
## Similarly, one can see those samples are a bit lower with respect to density
## Unless the data is very well behaved, the rest of the plots are not likely to
## look good until the data is normalized, nonetheless, lets see
pasilla_metrics$corheat
pasilla_metrics$disheat
pasilla_metrics$pc_plot
## So the above 3 plots are pretty much the worst case scenario for this data.
The most common normalization suggested by Najib is a cpm(quantile(filter(data))). On top of that we often do log2() and/or a batch adjustment. default_norm() does the first and may be supplemented with other arguments.
norm <- default_norm(pasilla_expt, transform = "log2")
## Removing 2622 low-count genes (7531 remaining).
norm_metrics <- graph_metrics(norm)
## Graphing number of non-zero genes with respect to CPM by library.
## Graphing library sizes.
## Graphing a boxplot.
## Graphing a correlation heatmap.
## Graphing a standard median correlation.
## Performing correlation.
## Graphing a distance heatmap.
## Graphing a standard median distance.
## Performing distance.
## Graphing a PCA plot.
## Graphing a T-SNE plot.
## Plotting a density plot.
## Plotting a CV plot.
## Naively calculating coefficient of variation/dispersion with respect to condition.
## Finished calculating dispersion estimates.
## Plotting the representation of the top-n genes.
## Plotting the expression of the top-n PC loaded genes.
## Printing a color to condition legend.
norm_metrics$corheat
norm_metrics$smc
norm_metrics$disheat
norm_metrics$smd
## some samples look a little troublesome here.
norm_metrics$pc_plot
With the above metrics in mind, we may perform a pairwise comparison of the data. By default, all_pairwise() performs every possible pairwise contrast, which in the case is comprised of just treated vs. untreated.
pasilla_pairwise <- sm(all_pairwise(pasilla_expt))
pasilla_tables <- sm(combine_de_tables(
pasilla_pairwise,
excel = "pasilla_tables.xlsx"))
pasilla_sig <- sm(extract_significant_genes(
pasilla_tables,
excel = "pasilla_sig.xlsx"))
pasilla_ab <- sm(extract_abundant_genes(
pasilla_pairwise,
excel = "pasilla_abundant.xlsx"))
pasilla_tables[["plots"]][["untreated_vs_treated"]][["deseq_ma_plots"]]$plot
pasilla_tables[["plots"]][["untreated_vs_treated"]][["edger_ma_plots"]]$plot
pasilla_tables[["plots"]][["untreated_vs_treated"]][["limma_ma_plots"]]$plot
up_genes <- pasilla_sig[["deseq"]][["ups"]][["untreated_vs_treated"]]
down_genes <- pasilla_sig[["deseq"]][["downs"]][["untreated_vs_treated"]]
pasilla_go <- load_biomart_go(species = "dmelanogaster")$go
## Using mart: ENSEMBL_MART_ENSEMBL from host: Apr2020.archive.ensembl.org.
## Finished downloading ensembl go annotations, saving to dmelanogaster_go_annotations.rda.
## Saving ontologies to dmelanogaster_go_annotations.rda.
## Finished save().
pasilla_length <- fData(pasilla_expt)[, c("ensembl_gene_id", "cds_length")]
colnames(pasilla_length) <- c("ID", "length")
pasilla_up_goseq <- simple_goseq(sig_genes = up_genes, go_db = pasilla_go,
length_db = pasilla_length)
## Found 103 go_db genes and 102 length_db genes out of 113.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
pasilla_up_goseq[["pvalue_plots"]][["bpp_plot_over"]]
pasilla_down_goseq <- simple_goseq(sig_genes = down_genes, go_db = pasilla_go,
length_db = pasilla_length)
## Found 97 go_db genes and 97 length_db genes out of 109.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
pasilla_down_goseq[["pvalue_plots"]][["bpp_plot_over"]]
high_genes <- names(pasilla_ab[["abundances"]][["deseq"]][["high"]][["treated"]])
pasilla_high_goseq <- simple_goseq(sig_genes = high_genes, go_db = pasilla_go,
length_db = pasilla_length)
## Found 182 go_db genes and 181 length_db genes out of 200.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
pasilla_high_goseq[["pvalue_plots"]][["bpp_plot_over"]]
low_genes <- names(pasilla_ab[["abundances"]][["deseq"]][["low"]][["treated"]])
pasilla_low_goseq <- simple_goseq(sig_genes = low_genes, go_db = pasilla_go,
length_db = pasilla_length)
## Found 175 go_db genes and 171 length_db genes out of 200.
## Testing that go categories are defined.
## Removing undefined categories.
## Gathering synonyms.
## Gathering category definitions.
pasilla_low_goseq[["pvalue_plots"]][["bpp_plot_over"]]
pander::pander(sessionInfo())
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: splines, stats4, parallel, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: pasilla(v.1.18.1), GO.db(v.3.12.1), AnnotationDbi(v.1.52.0), GOstats(v.2.56.0), edgeR(v.3.32.1), lme4(v.1.1-26), Matrix(v.1.3-2), BiocParallel(v.1.24.1), variancePartition(v.1.20.0), fission(v.1.10.0), ruv(v.0.9.7.1), SummarizedExperiment(v.1.20.0), GenomicRanges(v.1.42.0), GenomeInfoDb(v.1.26.2), IRanges(v.2.24.1), S4Vectors(v.0.28.1), MatrixGenerics(v.1.2.1), matrixStats(v.0.58.0), hpgltools(v.1.0), R6(v.2.5.0), Biobase(v.2.50.0) and BiocGenerics(v.0.36.0)
loaded via a namespace (and not attached): utf8(v.1.1.4), R.utils(v.2.10.1), tidyselect(v.1.1.0), RSQLite(v.2.2.3), htmlwidgets(v.1.5.3), grid(v.4.0.3), Rtsne(v.0.15), IHW(v.1.18.0), munsell(v.0.5.0), codetools(v.0.2-18), preprocessCore(v.1.52.1), statmod(v.1.4.35), withr(v.2.4.1), colorspace(v.2.0-0), Category(v.2.56.0), highr(v.0.8), knitr(v.1.31), rstudioapi(v.0.13), Vennerable(v.3.1.0.9000), robustbase(v.0.93-7), genoPlotR(v.0.8.11), labeling(v.0.4.2), slam(v.0.1-48), GenomeInfoDbData(v.1.2.4), lpsymphony(v.1.18.0), topGO(v.2.42.0), bit64(v.4.0.5), farver(v.2.0.3), rprojroot(v.2.0.2), vctrs(v.0.3.6), generics(v.0.1.0), xfun(v.0.21), BiocFileCache(v.1.14.0), fastcluster(v.1.1.25), doParallel(v.1.0.16), locfit(v.1.5-9.4), bitops(v.1.0-6), cachem(v.1.0.4), DelayedArray(v.0.16.1), assertthat(v.0.2.1), scales(v.1.1.1), gtable(v.0.3.0), affy(v.1.68.0), sva(v.3.38.0), rlang(v.0.4.10), genefilter(v.1.72.1), rtracklayer(v.1.50.0), lazyeval(v.0.2.2), selectr(v.0.4-2), broom(v.0.7.5), BiocManager(v.1.30.10), yaml(v.2.2.1), reshape2(v.1.4.4), GenomicFeatures(v.1.42.1), crosstalk(v.1.1.1), backports(v.1.2.1), qvalue(v.2.22.0), RBGL(v.1.66.0), tools(v.4.0.3), ggplot2(v.3.3.3), affyio(v.1.60.0), ellipsis(v.0.3.1), gplots(v.3.1.1), jquerylib(v.0.1.3), RColorBrewer(v.1.1-2), blockmodeling(v.1.0.0), Rcpp(v.1.0.6), plyr(v.1.8.6), progress(v.1.2.2), zlibbioc(v.1.36.0), purrr(v.0.3.4), RCurl(v.1.98-1.2), BiasedUrn(v.1.07), ps(v.1.5.0), prettyunits(v.1.1.1), openssl(v.1.4.3), ggrepel(v.0.9.1), colorRamps(v.2.3), magrittr(v.2.0.1), data.table(v.1.14.0), openxlsx(v.4.2.3), SparseM(v.1.81), goseq(v.1.42.0), pkgload(v.1.2.0), hms(v.1.0.0), evaluate(v.0.14), xtable(v.1.8-4), pbkrtest(v.0.5-0.1), XML(v.3.99-0.5), gridExtra(v.2.3), testthat(v.3.0.2), compiler(v.4.0.3), biomaRt(v.2.46.3), tibble(v.3.0.6), KernSmooth(v.2.23-18), crayon(v.1.4.1), minqa(v.1.2.4), R.oo(v.1.24.0), htmltools(v.0.5.1.1), mgcv(v.1.8-34), corpcor(v.1.6.9), tidyr(v.1.1.2), geneplotter(v.1.68.0), lubridate(v.1.7.9.2), DBI(v.1.1.1), geneLenDataBase(v.1.26.0), dbplyr(v.2.1.0), MASS(v.7.3-53.1), rappdirs(v.0.3.3), boot(v.1.3-27), ade4(v.1.7-16), readr(v.1.4.0), cli(v.2.3.1), quadprog(v.1.5-8), R.methodsS3(v.1.8.1), pkgconfig(v.2.0.3), GenomicAlignments(v.1.26.0), plotly(v.4.9.3), xml2(v.1.3.2), foreach(v.1.5.1), annotate(v.1.68.0), bslib(v.0.2.4), XVector(v.0.30.0), AnnotationForge(v.1.32.0), rvest(v.0.3.6), EBSeq(v.1.30.0), stringr(v.1.4.0), digest(v.0.6.27), graph(v.1.68.0), Biostrings(v.2.58.0), rmarkdown(v.2.7), GSEABase(v.1.52.1), directlabels(v.2021.1.13), curl(v.4.3), Rsamtools(v.2.6.0), gtools(v.3.8.2), nloptr(v.1.2.2.2), lifecycle(v.1.0.0), nlme(v.3.1-152), jsonlite(v.1.7.2), desc(v.1.2.0), viridisLite(v.0.3.0), askpass(v.1.1), limma(v.3.46.0), fansi(v.0.4.2), pillar(v.1.5.0), lattice(v.0.20-41), fastmap(v.1.1.0), httr(v.1.4.2), DEoptimR(v.1.0-8), survival(v.3.2-7), glue(v.1.4.2), zip(v.2.1.1), fdrtool(v.1.2.16), iterators(v.1.0.13), Rgraphviz(v.2.34.0), pander(v.0.6.3), bit(v.4.0.4), stringi(v.1.5.3), sass(v.0.3.1), blob(v.1.2.1), DESeq2(v.1.30.1), caTools(v.1.18.1), memoise(v.2.0.0) and dplyr(v.1.0.4)
message(paste0("This is hpgltools commit: ", get_git_commit()))
## If you wish to reproduce this exact build of hpgltools, invoke the following:
## > git clone http://github.com/abelew/hpgltools.git
## > git reset a07f461e2fb85aed84517dfa994c44b07ea25f30
## This is hpgltools commit: Sun Apr 25 20:45:25 2021 -0400: a07f461e2fb85aed84517dfa994c44b07ea25f30